{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quantizing 🤗 Transformers models with the GPTQ method can be done in a only few lines.\n",
    "\n",
    "📌 Note that for large model like 175B param, at least 4 GPU-hours will be needed if one uses a large dataset (e.g. `\"c4\"``).\n",
    "\n",
    "📌 Of course many GPTQ models are already available on the Hugging Face Hub, which bypasses the need to quantize a model yourself in most use cases. Nevertheless, you can also quantize a model using your own dataset appropriate for the particular domain you are working on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Quantizing 🤗 Transformers models with the GPTQ method\n",
    "\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig\n",
    "\n",
    "model_id = \"meta-llama/Llama-2-7b-hf\"\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
    "\n",
    "quantization_config = GPTQConfig(bits=4, dataset = \"c4\", tokenizer=tokenizer)\n",
    "\n",
    "model = AutoModelForCausalLM.from_pretrained(model_id, device_map=\"auto\",\n",
    "                                    quantization_config=quantization_config)\n",
    "\n",
    "model.push_to_hub(\"username/Llama-2-7b-gptq-4bit\")\n",
    "tokenizer.push_to_hub(\"username/Llama-2-7b-gptq-4bit\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`bits` param: Its the number of bits to quantize to, supported numbers are (2, 3, 4, 8).\n",
    "\n",
    "`dataset`: The dataset used for calibration. I would leave “c4“ which seems to yield reasonable results. Other datasets are supported according to the documentation.\n",
    "\n",
    "`tokenizer`: The tokenizer of Llama 2 7B that will be applied to c4.\n",
    "\n",
    "- `desc_act` : Whether to quantize columns in order of decreasing activation size. Setting it to False can significantly speed up inference but the perplexity may become slightly worse. If inference speed is not your concern, you should set `desc_act` to True.\n",
    "\n",
    "- `use_exllama` - bool, optional — Whether to use exllama backend. Defaults to True if unset. Only works with bits = 4. For 4-bit model, you can use the exllama kernels in order to a faster inference speed.\n",
    "\n",
    "You need to have the entire model on gpus if you want to `use_exllama` kernels. So if you plan to use the model on a configuration with a small VRAM that will split the model to multiple devices with `device_map`, you should set `use_exllama` to False."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py10env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
